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Abstract: This study aims to apply topic modeling from Twitter data about the Kanjuruhan tragedy, one 
of the trending topics due to a fatal incident that occurred after a football match at Kanjuruhan Stadium, 
in Malang, Indonesia. The research was conducted using the Latent Dirichlet Allocation (LDA), namely a 
text mining method to find certain patterns in a document by producing several different kinds of topics. 
The data used consists of 1480 tweets in the Indonesia language that had been pre-processed. This 
modeling has produced 5 main topics related to the Kanjuruhan tragedy such as the PSSI (Indonesian 
Football Association) investigation, suspects, the Itaewon tragedy, Korean netizens (Knetz), and tear gas. 
The implication of this research is not only to provide information about the comments and expectations 
of Twitter users regarding the Kanjuruhan tragedy but also to provide considerations for the stakeholder. 
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1. Introduction 


Social media Twitter is a forum that is widely used by the public to express opinions and comments on 
popular issues. Twitter provides an online social networking and microblogging service, which enables its 
users to send and read text-based messages. public comments on social media Twitter is a huge body of 
text data, which can be mined and analyzed. To obtain hidden topics from the corpus (collection of 
natural texts), topic modeling can be applied using the most popular topic modeling approaches are LDA 
(Latent Dirichlet Allocation) and Latent Semantic Analysis (LSA). Each topic will represent a variety of 
comments discussing the same context. 


Several studies related to topic modeling have been applied in various fields, such as bioinformatics [1] 
and transportation [2]. Twitter data-based topic modeling using the LDA method has also been carried out 
by several previous researchers [3][4][5]. Another study aims to make topic modeling to determine the 
topic of tweets about football news in Indonesian, using the LDA method, which has produced several 
topics such as pre-match analysis, live match updates, and football club achievements [6]. 


This study applies topic modeling based on tweet data about the tragedy at the Kanjuruhan Stadium, a 
fatal incident that caused hundreds of football spectators to die. The data used was taken from Twitter for 
the period of October 2022. We use Latent Dirichlet Allocation (LDA) as a topic modeling method to 
determine what topics appear on Twitter. The remainder of this paper consists of Section 2 describing the 
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materials and methods, section 3 describing the results and discussion, and section 4 explaining the 
conclusions. 


2. Materials and Methods 
2.1 Twitter Datasets 


Twitter is a website owned and operated by Twitter, Inc. which offers a social network in the form of a 
microblog. This site allows users to send and read blog messages as usual but is limited to only 140 
characters displayed on the user's profile page. Twitter has unique characteristics and writing formats with 
special symbols or rules. Messages on Twitter are known as Tweets. Twitter as one of the popular social 
media makes it very easy for its users to access a lot of information and channel their opinions [7]. 


The use of Twitter soars when an event that attracts public attention occurs, such as the Kanjuruhan 
tragedy. Tweet data related to the tragedy reached thousands from various user accounts. This study 
retrieved 2000 data from the Kanjuruhan tragedy tweets during the period of October 2022. After cleaning 
up the duplicate data, 1480 tweet data remained. 


2.2. Topic Modelling 


The concept of topic modeling consists of entities namely "word", "document", and "corpora". “Word” is 
considered the basic unit of discrete data in a document, defined as an item of vocabulary that is indexed 
for each unique word in the document. “Document” is an arrangement of N words. A corpus is a 
collection of M documents and corpora is the plural form of corpus. While "topic" is the distribution of 
some fixed vocabulary. each document in the corpus contains its own proportion of the topics discussed 
according to the words contained therein. Topic modeling has been of interest to most authors from the 
fields of Text Mining, Natural Language Processing, and Machine Learning [1]. 


The purpose of topic modeling is to determine topics automatically from a set of documents that have a 
hidden structure in the form of topics, distribution of topics per document, and determination of topics per 
word in each document. Topic modeling uses these documents to infer hidden topic structures. The 
number of topics to be generated has been determined before the topic modeling process is carried out. 
[2]. 

2.3. Latent Dirichlet Allocation 


Latent Dirichlet Allocation (LDA) is a generative probabilistic model of discrete data collections such as 
a set of documents (text corpus). In the context of text modeling, topic probabilities provide an explicit 
representation of a document. The basic idea of LDA is that a document consists of several topics. The 
LDA process is generative through an imaginary random process in a model that assumes that documents 
originate from a certain topic, and each topic consists of a distribution of words. The LDA concept is 
shown in Figure | [8]. 
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Fig. 1. LDA concept [9] 


LDA is also called a text mining method for finding certain patterns in a document by generating several 
different types of topics [10]. LDA was chosen because it can analyze large data and documents. LDA 
uses the bag of words method to identify hidden topic information in large sets of documents [11]. 


2.4. Python dan Google Colab 


Google colab is an executable document that can be used to store, write, and share programs that have 
been written via Google Drive. Google colab is a coding environment in a notebook format that is user- 
friendly and can support all needs related to data science and machine learning. This software is similar to 
Jupyter Notebook in the form of a cloud that runs using the Google Chrome browser. Meanwhile, Python 
is a popular programming language used in the Google Colab environment. Python is an open-source 
programming language, easy to use and has many supporting libraries for data science and machine 
learning needs. for example, text pre-processing used for topic modeling using the Python programming 
language was carried out by [6] [12]. 


2.5. Modelling Stages 


The stages of topic modeling in this study are shown in Figure 2, starting from the data collection stage, 
data pre-processing, topic modeling, visualization, until the results analysis. 
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Fig. 2. Modelling stages 
A. Data Collection 


Data retrieval using the Twint library with the keywords "Tragedi Kanjuruhan", language "id", limit 2000, 
period 1 October to 31 October 2022, from several accounts, including official accounts such as 
detik.com, jawapos.com , and hariankompas.com. 


B. Data Pre-processing 


The preprocessing stage consists of cleaning data, selecting attributes, case folding (changing into 
lowercase), tokenizing (removing unnecessary characters or symbols), stopwords (cleaning text from 
words that have no meaning), normalizing (replacing certain words with more appropriate words such as 
jatim to Jawa Timur), stemming (cutting affixes to text using the Sastrawi and Swifter packages). 


C.LDA Topic Modelling 


LDA topic modeling in this study uses the LdaModel library provided by the Gensim library with Python 
[6]. We determine five topics as parameters, and the following is a modeling code snippet. 


import gensim 

from gensim import corpora 

Lda = gensim.models.LdaModel 

dictionary = corpora.Dictionary(doc_clean) 


bow_corpus = [dictionary.doc2bow(doc) for doc in doc_clean] 
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total_topics = 5 
number_words = 8 
D. Visualization 


The results of LDA topic modeling are visualized using the Gensim library and pyLDAvis in Python. 
PyLDAvis is a web-based interactive topic model visualization using LDA built from LDAvis using a 
combination of R and D3 [13]. The pyLDAvis library, used for browsing relationships between topics and 
terms to understand LDA model. PyLDAvis has two panels, the distribution map each topic and the most 
representative intensity graph terms frequently found in the corpus. The following is a visualization code 
snippet. 


import pyLDAvis.gensim 
import pickle 
import pyLDAvis 


import os 


# Visualize the topics 

pyLDAvis.enable_notebook() 

LDAvis_data_filepath = 
os.path.join(‘Idavis_prepared_'+str(total_topics)) 

corpus = [dictionary.doc2bow(text) for text in doc_clean] 
i es 

LDAvis_prepared= pyLDAvis.gensim.prepare(lda_model, corpus, 
dictionary) 

with open(LDAvis_data_filepath, 'wb') as f: 
pickle.dump(LDAvis_prepared, f) 

3. Results and Discussion 


This section discusses the results of topic modelling. This modeling has produced 5 main topics related to 
the Kanjuruhan tragedy such as the Itaewon tragedy (topic #1), the PSSI investigation (topic #2), suspects 
(topic #3), Korean netizens/Knetz (topic #4), and tear gas (topic #5). Table | shows the results of the bag 
of words weighting. We determine eight words as parameters (K = kata = word) and translate in English 
for common words. 


Table 1. The results of the bag of words weighting 


Topic Kl K2 K3 K4 K5 K6 K7 K8 
(Word1) 

Topic | itaewon october victim | halloween people dead stadium closed 
#1 (63%) (40%) (31%) (16%) (13%) (12%) (8%) (8%) 
Topic | indonesia | investigate pssi ball (14%) | thoroughly | suporter stay soccer 
#2 (23%) (15%) (14%) (14%) (12%) (12%) (9%) 
Topic | suspect kapolri pssi permanent | lib (19%) pt malang | director 
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#3 (57%) (26%) (23%) (20%) (16%) (14%) (12%) 
Topik | Country knetz lu indo people gw itaewon | yesterday 
#4 (22%) (15%) (10%) (10%) (10%) (10%) (8%) (8%) 
Topik | eye (19%) people itaewon pas gas (11%) | water hope victim 

(11%) (10%) (10%) 


#5 (17%) (12%) (12%) 


The distance map visualization between topics from this model and the top 30 most prominent words in 
the corpus is shown in Figure 3, which is one of the results of modeling visualization. The bar chart in 
Figure 3 shows the 30 most prominent words in the corpus on the topic "Itaewon". Figure 3 shows five 
topic clusters that can be grouped independently. These clusters cover topics that can be seen from a 
distance between clusters, and explain that the distribution and frequency of words within these topics is 
very unique. The word "Itaewon" appeared at the top because of the many comments by Indonesian 
netizens who replied to comments by Korean netizens. Previously, many Korean netizens commented on 


the Kanjuruhan tragedy. Examples of other topic visualizations are shown in Figure 4. 


Selected Topic: [7] | Previous Topic || Next Topic || Giear Topic 


Fig.3. LDA visualization (topic #1 - the Itaewon tragedy) 


Figure 4 shows the 30 most prominent words in the corpus on the topic "suspects". The bar chart in Figure 
4 illustrates the distribution of words that refer to the topic of the suspect in the Kanjuruhan tragedy. 
Previous Topic || Next Topic || Clear ] Slide to adjust relevance metric:2) eee 8 
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Fig.4. LDA visualization (topic #2 - the PSSI investigation) 
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4. Conclusion 


This modeling has produced 5 main topics related to the Kanjuruhan tragedy such as the PSSI (Indonesian 
Football Association) investigation, suspects, the Itaewon tragedy, Korean netizens (Knetz), and tear gas. 
The word "Itaewon" appeared at the top because of the many comments by Indonesian netizens who 
replied to comments by Korean netizens. Previously, many Korean netizens commented on the 
Kanjuruhan tragedy. 


The implication of this research is not only to provide information about the comments and expectations 
of Twitter users regarding the Kanjuruhan tragedy but also to provide considerations for the stakeholder. 
Meanwhile, this study still needs to be improved such as the use of metric coherence scores in 
determining the number of topics. To find out more about the performance of the LDA method in 
extracting topics from Bahasa Indonesian text documents, by comparing this method with other non-topic 
based methods. 
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